{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/data_connectors/lagecy_office_reader.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Legacy Office Reader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `LegacyOfficeReader` is the reader for Word-97(.doc) files. Under the hood, it uses Apache Tika to parse the file."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Started"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙 and the legacy office reader.\n",
    "\n",
    "> Note: Apache Tika is a dependency of the legacy office reader and it requires Java to be installed and call-able via `java --version`.\n",
    "> \n",
    "> For instance, on colab, you can install it with `!apt-get install default-jdk`. or on macOS, you can install it with `brew install openjdk`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-readers-legacy-office"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prepare Data\n",
    "\n",
    "So we need to prepare a .doc file for testing. Supposedly it's in `test_dir/harry_potter_lagacy.doc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.readers.legacy_office import LegacyOfficeReader"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Option 1**: Load the file with `LegacyOfficeReader`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = \"./test_dir/harry_potter_lagacy.doc\"\n",
    "reader = LegacyOfficeReader(\n",
    "    excluded_embed_metadata_keys=[\"file_path\", \"file_name\"],\n",
    "    excluded_llm_metadata_keys=[\"file_type\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 1 docs\n"
     ]
    }
   ],
   "source": [
    "docs = reader.load_data(file=file_path)\n",
    "print(f\"Loaded {len(docs)} docs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Option 2**: Load the file with `SimpleDirectoryReader`\n",
    "\n",
    "This is the path where we have `.doc` files together with other files in the same directory.\n",
    "\n",
    "```python\n",
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "reader = SimpleDirectoryReader(\n",
    "    input_dir=\"./test_dir/\",\n",
    "    file_extractor={\n",
    "        \".doc\": LegacyOfficeReader(),\n",
    "        }\n",
    ")\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
